Goto

Collaborating Authors

 control module




CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion

Ren, Maoye, Vaddamanu, Praneetha, Xu, Jianjin, Frade, Fernando De la Torre

arXiv.org Artificial Intelligence

Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.


Cross-Modality Controlled Molecule Generation with Diffusion Language Model

Zhang, Yunzhe, Wang, Yifei, Nguyen, Khanh Vinh, Hong, Pengyu

arXiv.org Artificial Intelligence

They inject conditioning signals at the start of the training process and require retraining a new model from scratch whenever the constraint changes. However, real-world applications often involve multiple constraints across different modalities, and additional constraints may emerge over the course of a study. This raises a challenge: how to extend a pre-trained diffusion model not only to support cross-modality constraints but also to incorporate new ones without retraining. To tackle this problem, we propose the Cross-Modality Controlled Molecule Generation with Diffusion Language Model (CMCM-DLM), demonstrated by two distinct cross modalities: molecular structure and chemical properties. Our approach builds upon a pre-trained diffusion model, incorporating two trainable modules, the Structure Control Module (SCM) and the Property Control Module (PCM), and operates in two distinct phases during the generation process. In Phase I, we employs the SCM to inject structural constraints during the early diffusion steps, effectively anchoring the molecular backbone. Phase II builds on this by further introducing PCM to guide the later stages of inference to refine the generated molecules, ensuring their chemical properties match the specified targets. Experimental results on multiple datasets demonstrate the efficiency and adaptability of our approach, highlighting CMCM-DLM's significant advancement in molecular generation for drug discovery applications.


Voice Impression Control in Zero-Shot TTS

Fujita, Keinichi, Horiguchi, Shota, Ijima, Yusuke

arXiv.org Artificial Intelligence

Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.


Design and Implementation of a Peer-to-Peer Communication, Modular and Decentral YellowCube UUV

Xu, Zhizun, Jia, Baozhu, Shi, Weichao

arXiv.org Artificial Intelligence

--The underwater Unmanned V ehicles(UUVs) are pivot tools for offshore engineering and oceanographic research. Most existing UUVs do not facilitate easy integration of new or upgraded sensors. A solution to this problem is to have a modular UUV system with changeable payload sections capable of carrying different sensor to suite different missions. The design and implementation of a modular and decentral UUV named Y ellowCube is presented in the paper . Instead a centralised software architecture which is adopted by the other modular underwater vehicles designs, a Peer-T o-Peer(P2P) communication mechanism is implemented among the UUV's modules. The experiments in the laboratory and sea trials have been executed to verify the performances of the UUV . Over the past few decades, the Unmanned Underwater V ehicles(UUVs) have become the essential tools in the offshore engineering and the ocean research. Their tasks ranges from the offshore engineering, oceanographic research, salvage and rescue to the military monitoring.


SGN-CIRL: Scene Graph-based Navigation with Curriculum, Imitation, and Reinforcement Learning

Oskolkov, Nikita, Zhang, Huzhenyu, Makarov, Dmitry, Yudin, Dmitry, Panov, Aleksandr

arXiv.org Artificial Intelligence

-- The 3D scene graph models spatial relationships between objects, enabling the agent to efficiently navigate in a partially observable environment and predict the location of the target object. This paper proposes an original framework named SGN-CIRL (3D Scene Graph-Based Reinforcement Learning Navigation) for mapless reinforcement learning-based robot navigation with learnable representation of open-vocabulary 3D scene graph. T o accelerate and stabilize the training of reinforcement learning-based algorithms, the framework also employs imitation learning and curriculum learning. The first one enables the agent to learn from demonstrations, while the second one structures the training process by gradually increasing task complexity from simple to more advanced scenarios. Numerical experiments conducted in the Isaac Sim environment showed that using a 3D scene graph for reinforcement learning significantly increased the success rate in difficult navigation cases. The code is open-sourced and available at: https://github.com/Xisonik/Aloha


A Neural Network Mode for PX4 on Embedded Flight Controllers

Hegre, Sindre M., Rehberg, Welf, Kulkarni, Mihir, Alexis, Kostas

arXiv.org Artificial Intelligence

This paper contributes an open-sourced implementation of a neural-network based controller framework within the PX4 stack. We develop a custom module for inference on the microcontroller while retaining all of the functionality of the PX4 autopilot. Policies trained in the Aerial Gym Simulator are converted to the TensorFlow Lite format and then built together with PX4 and flashed to the flight controller. The policies substitute the control-cascade within PX4 to offer an end-to-end position-setpoint tracking controller directly providing normalized motor RPM setpoints. Experiments conducted in simulation and the real-world show similar tracking performance. We thus provide a flight-ready pipeline for testing neural control policies in the real world. The pipeline simplifies the deployment of neural networks on embedded flight controller hardware thereby accelerating research on learning-based control. Both the Aerial Gym Simulator and the PX4 module are open-sourced at https://github.com/ntnu-arl/aerial_gym_simulator and https://github.com/SindreMHegre/PX4-Autopilot-public/tree/for_paper. Video: https://youtu.be/lY1OKz_UOqM?si=VtzL243BAY3lblTJ.


ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

Zhou, Xukun, Li, Fengxin, Chen, Ming, Zhou, Yan, Wan, Pengfei, Zhang, Di, Jin, Yeying, Fan, Zhaoxin, Liu, Hongyan, He, Jun

arXiv.org Artificial Intelligence

Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation. Despite notable progress, existing methods often produce gestures that are coarse, lack expressiveness, and fail to fully align with audio semantics. To address these challenges, we propose ExGes, a novel retrieval-enhanced diffusion framework with three key designs: (1) a Motion Base Construction, which builds a gesture library using training dataset; (2) a Motion Retrieval Module, employing constrative learning and momentum distillation for fine-grained reference poses retreiving; and (3) a Precision Control Module, integrating partial masking and stochastic masking to enable flexible and fine-grained control. Experimental evaluations on BEAT2 demonstrate that ExGes reduces Fr\'echet Gesture Distance by 6.2\% and improves motion diversity by 5.3\% over EMAGE, with user studies revealing a 71.3\% preference for its naturalness and semantic relevance. Code will be released upon acceptance.


HALO: Fault-Tolerant Safety Architecture For High-Speed Autonomous Racing

Harder, Aron, Kulkarni, Amar, Behl, Madhur

arXiv.org Artificial Intelligence

The field of high-speed autonomous racing has seen significant advances in recent years, with the rise of competitions such as RoboRace and the Indy Autonomous Challenge providing a platform for researchers to develop software stacks for autonomous race vehicles capable of reaching speeds in excess of 170 mph. Ensuring the safety of these vehicles requires the software to continuously monitor for different faults and erroneous operating conditions during high-speed operation, with the goal of mitigating any unreasonable risks posed by malfunctions in sub-systems and components. This paper presents a comprehensive overview of the HALO safety architecture, which has been implemented on a full-scale autonomous racing vehicle as part of the Indy Autonomous Challenge. The paper begins with a failure mode and criticality analysis of the perception, planning, control, and communication modules of the software stack. Specifically, we examine three different types of faults - node health, data health, and behavioral-safety faults. To mitigate these faults, the paper then outlines HALO safety archetypes and runtime monitoring methods. Finally, the paper demonstrates the effectiveness of the HALO safety architecture for each of the faults, through real-world data gathered from autonomous racing vehicle trials during multi-agent scenarios.